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Clusters of processors are interconnected as an emulation 
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the setup and storing of results are done in parallel, but the 
output of one evaluation unit is connected to the input of the 
next evaluation unit. A set of 'cascade' connections provides 
access to the intermediate values. By tapping intermediate 
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significant emulation speedup is achieved. 
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CLUSTERED PROCESSORS IN AN FIG. 2 illustrates how, in accordance with the invention, 

EMULATION ENGINE clusters of processors share input and data stacks and are 

interconnected such that the setup and storing of results is 
done in parallel, and an option is available to route the output 
FIELD OF THE INVENTION $ 0 f one evaluation unit, via 'cascade' connections, to the 

This invention relates to processor-based emulation in P ut of toe next evaluation unit, 
engines. FIG. 3 illustrates a single processor, with times listed as 

Dl through D4, showing how the total step time is equal to 
TRADEMARKS mc sum D1+D2+D3+D4. 

IBM is a registered trademark of International Business 10 FIG. 4 shows four clustered processors and their shared 
Machines Corporation, Armonk, N.Y. input and data stacks, with the signal 'cascading' through the 

four function evaluation units, and with the total step time 
equal to the same sum D1+D2+D3+D4. 



BACKGROUND 



Hardware emulators are programmable devices used in FIG. 5 illustrates three methods of routing thirteen signals 
the verification of logic designs. A common method of logic through four function evaluation units, with the total step 
design verification is to use processors to emulate the design. ti me in each case equal to the same sum D1+D2+D3+D4. 
These processor-based emulators sequentially evaluate com- 
binatorial logic levels, starting at the inputs and proceeding DETAILED DESCRIPTION OF THE 
to the outputs. Each pass through the enure set of logic levels INVENTION 

is called a Target Cycle; the evaluation of each individual 20 Before tlirning t0 the detailed description of our 
logic level is called an Emulation Step. invention, we would note that one method of speedup is to 

Speed is a major selling factor in the emulator market, and evaluate independent logic paths in parallel. A parallel 
is a well known problem. The purpose of this invention is to system may consist of hierarchically arranged processors: 

significantly improve our emulator's speed. 25 multiprocessor modules on multi-module boards, in a multi- 

Our invention is an improvement over that disclosed in board system. Synchronization is achieved by delaying the 
U.S. Pat. No. 5,551,013, "Multiprocessor for Hardware start of the next target cycle until the completion of all paths. 
Emulation/' issued to Beausoleil, et al., where a software- This means that the effective emulation speed is determined 
driven multiprocessor emulation system with a plurality of by the time required to evaluate the longest path (called the 

emulation processors connected in parallel in a module has 30 critical path). 

one or more modules of processors to make up an emulation For evaluation of independent logic paths in parallel, we 
system. Our current processor-based emulator consists of a ca n describe our improvement over that disclosed in U.S. 
large number of interconnected processors, each with an p a t. No. 5,551,013, "Multiprocessor for Hardware 
individual control store, as described in detail in the U.S. Emulation," issued to Beausoleil, et al. (fully incorporated 

Pat. No. 5,551,013. It would be desirable to improve the 35 herein by this reference) where a software -driven multipro- 
speed of this emulator. cessor emulation system with a plurality of emulation pro- 

While not suitable for our purposes, but for completeness, cessors connected in parallel in a module has one or more 
we note that FPGA-based emulation systems exist that modules of processors to make up an emulation system. To 
achieve high speeds for small models. However, FPGA- illustrate, refer to FIG. 1 of U.S. Pat. No. 5,551,013, which 

based emulators are inherently I/O bound, and therefore 40 shows an emulation chip, called a module here, having 
perform poorly with large models. In general, the problem of multiple (e.g. 64) processors. All processors within the 
high-speed emulation of large models had not been solved. module are identical and have the internal structure shown 

in FIG. 1. The sequencer and the interconnection network 
SUMMARY OF THE INVENTION occurs only 0flce in a modulc COfltro] stofes hold a 

We have increased the processor-based emulation speed 45 program created by an emulation compiler for a specified 
by increasing the amount of work done during each emu- processor. The stacks hold data and inputs previously gen- 
lation step. In the original emulator, an emulation step erated and are addressed by fields in a corresponding control 
consisted of a setup phase, an evaluation phase, and a word to locate the bits for input to the logic element. During 
storage phase. With this invention, clusters of processors are each step of the sequencer an emulation processor emulates 
interconnected such that the evaluation phases can be cas- 50 a logic function according to the emulation program. The 
caded. All processors in a cluster perform the setup in data flow control interprets the current control word to route 
parallel. This setup includes routing of the data through and latch data within the processor. The node -bit-out signal 
multiple evaluation units for the evaluation phase. (For most from a specified processor is presented to the interconnec- 
efficient operation, the input stack and data stack of each tion network where it is distributed to each of the multi- 
processor must be stored in shared memory within each 55 plexors (one for each processor) of the module. The node 
cluster.) Then, all processors perform the storage phase, address field in the control word allows a specified processor 
again in parallel. The net result is multiple cascaded evalu- to select for its node -bit-in signal the node-bit-out signal 
ations performed in a single emulation step. A key feature of from any of the processors within its module. The node bit 
the invention is that every processor in a cluster can access is stored in the input stack on every step. During any 
the input and data stacks of every other processor in the 60 operation the node -bit-out signal of a specified processor 
cluster. may be accessed by none, one, or all of the processors within 

the module. 

DESCRIPTION OF THE DRAWINGS DaU rmlting wi(hin eacb processor > s data flow and 

FIG. 1 illustrates bow a processor reads a logic function through the interconnection network occurs independently 

and associated operands from the input and data store, 65 of and overlaps the execution of the logic emulation function 

performs the operation, and writes the results, all in a single in each processor. Each control store stores control words 

step. executed sequentially under control of the sequencer and 
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program steps in the associated module. Each revolution of system to emulate a complex array of Boolean logic, which 

the sequencer causes the step value to advance from zero to may represent all of the logic gates in a complex logic 

a predetermined maximum value and corresponds to one semiconductor chip or system. Each cycle of processing 

target clock cycle for the emulated design. Acontrol word in mav control the emulation of a level of logic being verified 

the control store is simultaneously selected during each step s by the single emulation processor illustrated in FIG. 1, 

of the sequencer. A logic function operation is defined by having our 'cascade' connection control facility improve- 

each control word. Thus, we have provided in FIG. 1 a ment. 

software-driven multiprocessor emulation system which Fof a mofe detailed understanding of our invention, it 

uses in a module a plurality of emulation processors. Each should bc undcrstood that at each emulation step, a proces- 

of these emulation processors has an execution unit for a0 S or reads a logic function and associated operands from the 

processing multiple types of logic gate functions. Each daU st0 ^ performs me operation, and writes the results as 

emulation processor switches from a specified one logic gate nhlstrated by FIG . 2 (consider the first stage as illustrative 

function to a next logic gate function in a switched- here). The internal clock frequency of the emulator is given 

emulation sequence of different gate functions. The a s(l/t), where t is the time taken for a single step. In general, 

switched-emulation sequence of each of the processors thus :5 tf a processor fc designated to evaluate the critical path with 

can emulate a subset of gates in a hardware arrangement in fl { [c men the time takeD wi]1 be (nn) (This 

which logic gates are of any type that the emulation pro- assumes that lhe evaluation of the logic levels ^ QOt de layed 

cessors functionally represent for a sequence of clock cycles. b mc availability of mc mput signa i s . sharing input and 

Theprocessorsarecoupledbyalikenumberofmultiplexors data stacks within me dusters greatly enh ances the prob- 

having outputs respectively connected to the emulation 20 ability that signals are avaHable when needed.) The effective 

processors of a module and having inputs respectively g . of ^ emulatOT) measur ed in cycles per unit time, is 
connected to each of the other emulation processors The bus ivcn as 1/(n * t) M om goal is tQ makc the cmulator mn as 

connected to the multiplexors enables an output from any fagt as possiblCy we have developed the system as illustrated, 

emulation processor to be transferred to an mput of any other whefe wheD) ^ stated aboye> t re p reseQts the time taken for 

of the emulation processors. In accordance with our 25 a &{ , emu iation step, our invention enables, with the 

improvement, it will be understood that we have provided abim tQ evaluate four logic functions in the same time t, a 

clusters of processors which are interconnected as an emu- 4Q0% speedup by cnabling each processor to evaluate 

lation engine such that the setup and storing of results is effective]y in paraUel four (four stages are shown ^ HG. 2) 

done in parallel, but the output of one evaluation unit is b . ^ this same time t . 

made available as the input of the next evaluation unit. For 30 ^ . , t , A . , , 

... Li j . i • . a a * Before we developed our current emulator, the clock 

this purpose we enabled processors to share input and data , . . . * . 

. i j u -a a * c< a > granularity was the time for one processor to evaluate one 

stacks, and have provided a set of 'cascade connections f . „ J . w . r . . f . , 

... .j r a- * 1 *n logic function. We have found that signal propagation times 

which provides access to the intermediate values as we will » . -j *• j * • *u * 

j -if r» . • «• . ~ a- ♦ * i f and power consumption considerations determine the step 

describe. By lapping 'intermediate values from one ** f . „ * 

j r j- *t_ * 4 - -a * ™ 1 time t. This time t is greater-than or equal- to D1+D2+D3+ 

processor, and feeding them to the next, significant emula- 35 6 M 

tion speedup is achieved. 

The embedded control store in each of the emulation ™* sum > D1+D2+D3+D4, includes reading from the 

processors stores logic-representing signals for controlling data store > setUn S U P the opcraUoa, performing the 

operations of the emulation processor. The emulation evaluation, and stormg the results. Note that setup can 

engine's processor evaluation unit illustrated by FIG. 1 is 40 include S atherin g data from other processors on the same 

provided with an embedded data store for each of the module or on other modules. We determined that for our 

emulation processors which receives data generated by the P lanned interconnection networks, the setup times dominate 
very same emulation processor under control of software sum > **** * a kr £ e ^erential between the amount of 

signals stored in the embedded control store in the same lime s P ent durin g versus me of time s P ent 

emulation processor. It is the controls that are used to 45 d uring the logic evaluation. 

transmit data from any emulation processor through a con- We have provided, in accordance with our invention, the 

nected multiplexor under control of software signals stored ability to exploit this time differential by tapping the results 

in the embedded control store to control computational from one processor and feeding them to the next, within the 

emulation of the hardware arrangement by operation of the step time t. Thus, when clusters of processors are intercon- 

plurality of processors which form evaluation units of the 50 nected such that the setup and storing of results is done in 

emulation engine under software control in accordance with parallel, as illustrated by FIG. 2, the output of one evaluation 

the following description of FIGS. 2, 3, 4, and 5. unit has the option of being connected to the input of the next 

An execution unit in each processor's emulation unit evaluation units. We have, in accordance with our invention, 

includes a table-lookup unit for emulating any type of logic a set of 'cascade' connections which provides access to these 

gate function and a connection from the output of each 55 intermediate values. 

processor to a multiplexor input with every other processor FIG. 3 shows a single processor, with the times listed as 

in a module. Each processor embeds a control store to store Dl through D4; the relative times are not drawn to scale. The 

software logic-representing signals for controlling opera- total step time t is equal to the sum D1+D2+D3+D4. Now 

lions of each processor. Also in the prior system a data store when we illustrate our invention in accordance with FIG. 4 

is embedded in each processor to receive data generated 60 with four clustered processors arranged with the signal 

under control of the software signals in the control store. The flowing through all four function evaluation units, here 

parallel processors on each module have a module input and again, the total step time is D1+D2+D3+D4. Note that the 

a module output from each processor. The plurality of number of evaluations that can be performed within a step 

modules have their module outputs interconnected to mod- is limited by the relative times of DF and D3. The connec- 

ule inputs of all other modules. A sequencer synchronously 65 tions between the processors in FIG. 4 are through the 

cycles the processors through mini-cycles on all modules. cascade connections shown in FIG. 2. lb visualize the 

Logic software drives all of the processors in the emulation speedup achieved through this invention, consider a logic 
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path with 18 levels, A through R. In our current emulator, 
each evaluation would take a single step, for a total time of 
18 steps. With this invention, levels A through D would be 
distributed among the four processors in a cluster for evalu- 
ation in the first step. E through H would be distributed to s 
the same four processors for evaluation in the second step. 
I through L would be evaluated in the third step, M through 
P in the fourth step, and Q and R in the fifth step. The 
evaluation of the entire path would be reduced from 18 to 5 
steps. ^ 10 

Illustrating how different connections can be made for 
differing numbers of processors, FIG. 5 illustrates three 
methods of routing thirteen signals through four function 
evaluation units, with the total step time in each case equal 
to the same sum D1+D2+D3+D4. 15 

While the preferred embodiment to the invention has been 
described, it will be understood that those skilled in the art, 
both now and in the future, may make various improvements 
and enhancements which fall within the scope of the claims 
which follow. These claims should be construed to maintain 20 
the proper protection for the invention first described. 

What is claimed is: 

1. A method for use in a software-driven multiprocessor 
emulation system, wherein there are a plurality of emulation 
processors, each emulation processor containing an execu- 25 
tion unit for processing multiple types of logic gate 
functions, with means provided for each emulation proces- 
sor to switch from one logic gate function to a next logic gate 
function in a switched-emulation sequence of different gate 
functions, and wherein the switched-emulation sequence of 30 
each of the plurality of processors enables emulating a 
subset of gates in a hardware arrangement in which logic 
gates are of any type that the emulation processors func- 
tionally represent for a sequence of clock cycles, and 
wherein a plurality of multiplexors having outputs respec- 35 
tively connected to the emulation processors and having 
inputs respectively connected to each of the other emulation 
processors, and bus means connected to the multiplexors to 
enable an output from any emulation processor to be trans- 
ferred to an input of any other of the emulation processors, 40 
an embedded control store in each of the emulation proces- 
sors to store software logic-representing signals for control- 
ling operations of the emulation processor, an embedded 
data store in each of the emulation processors to receive 
input for, and data generated by the same emulation pro- 45 
cessor under control of software signals stored in the embed- 
ded control store in the same emulation processor, and bus 
controls to transmit data on the bus from any emulation 
processor through a connected multiplexor under control of 
software signals stored in the embedded control store to 50 
control computational emulation of the hardware arrange- 
ment by operation of the plurality of emulation processors 
under software control, comprising the steps of: 

on an integrated circuit, arranging said plurality of emu- 
lation processors to form a plurality of clusters of 55 
processors such that the setup and storing of results 
calculated by each of said plurality of emulation pro- 
cessors within each of said plurality of clusters is done 
in parallel, but the output of one of said emulation 
processors within each of said plurality of clusters is 60 
made available as the input of said emulation proces- 
sors within each of said plurality of clusters via a set of 
'cascade' connections to intermediate values used in 
the emulation engine by tapping said intermediate 
values from one emulation processor within one of said 65 
plurality of clusters, and feeding said intermediate 
values to the next emulation processor within said one 
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of said plurality of clusters via said cascade connec- 
tions to exploit a time differential between data flow 
where time is measured as the time taken to perform a 
single emulation step. 

2. The method according to claim 1, wherein for each 
emulation step, an emulation processor reads a logic func- 
tion and associated operands from the data store, performs 
the operation, and writes the results with an effective speed 
of the emulation; measured in cycles per unit time, given as 
l/(n*t), where n is the number of logic levels emulated and 
t is the time measured as the time taken to perform a single 
emulation step, and multiple logic functions are evaluated 
effectively in parallel at the same time t. 

3. The method according to claim 2, wherein for each 
cluster, a single emulation step includes reading from the 
data store (which contains both input data and generated 
data), setting up the operation, performing the evaluation, 
and storing the results as an intermediate value, and feeding 
said intermediate value from one processor to the next 
processor of a cluster, such that the setup and storing of 
results is done in parallel with the output of one emulation 
processor connected to the input of the next emulation 
processor with said cascade connection providing access to 
the intermediate values. 

4. The method according to claim 3, wherein signals flow 
through all emulation processors within a cluster, and 
wherein multiple logic levels are distributed among said 
emulation processors within a cluster for evaluation as a 
single step, with the same emulation processors within a 
cluster being employed for evaluation of additional logic 
levels in additional steps employed for other logic levels 
with n evaluation steps, where n is the number of logic levels 
evaluated. 

5. A method for use in a software-driven multiprocessor 
emulation system, wherein there are a plurality of emulation 
processors, each emulation processor containing an execu- 
tion unit for processing multiple types of logic gate 
functions, with means provided for each emulation proces- 
sor to switch from one logic gate function to a next logic gate 
function in a switched-emulation sequence of different gate- 
functions, and wherein the switched-emulation sequence of 
each of the plurality of processors enables emulating a 
subset of gates in a hardware arrangement in which logic 
gates are of any type that the emulation processors func- 
tionally represent for a sequence of clock cycles, and 
wherein a plurality of multiplexors having outputs respec- 
tively connected to the emulation processors and having 
inputs respectively connected to each of the other emulation 
processors, and bus means connected to the multiplexors to 
enable an output from any emulation processor to be trans- 
ferred to an input of any other of the emulation processors, 
an embedded control store in each of the emulation proces- 
sors to store software logic-representing signals for control- 
ling operations of the emulation processor, an embedded 
data store in each of the emulation processors to receive 
input for, and data generated by the same emulation pro- 
cessor under control of software signals stored in the embed- 
ded control store in the same emulation processor, and bus 
controls to transmit data on the bus from any emulation 
processor through a connected multiplexor under control of 
software signals stored in the embedded control store' to 
control computational emulation of the hardware arrange- 
ment by operation of the plurality of emulation processors 
under software control, comprising the steps of: 

on an integrated circuit, arranging said plurality of emu- 
lation processors to form a plurality of clustered 
processors, each of said plurality of clustered proces- 
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sors sharing one input and data stacks, said clustered 
processors providing an emulation engine such that the 
setup and storing of results is done in parallel, but the 
output of one of said plurality of emulation processors 
is made available as the input of a next one of said 
plurality of emulation processors via a set of 'cascade* 
connections to intermediate values used in the emula- 
tion engine by tapping said intermediate values from 
one of said emulation processors within one of said 
clustered processors, and feeding said intermediate 
values to the next one of said emulation processors 
within one of said clustered processors via said cascade 
connections to exploit a time differential between data 
flow where time is measured as the time taken to 
perform a single emulation step. 

6. The method according to claim 5, wherein for each 
emulation step, an emulation processor reads a logic func- 
tion and associated operands from the data store of any 
emulation processor within the clustered processor, per- 
forms the operation, and writes the results with an effective 
speed of the emulation, measured in cycles per unit time, 
given as l/(n*t), where n is the number of logic levels 
emulated and t is the time measured as the time taken to 
perform a single emulation step, and multiple logic func- 
tions are evaluated effectively in parallel at tie same time t. 

7. The method according to claim 6, wherein for each 
emulation engine, a single emulation step includes reading 
from the data store (which contaios both input data and 
generated data) of any emulation processor within the clus- 
tered processor, setting up the operation, performing the 
evaluation, and storing the results as an intermediate value, 
and feeding the intermediate value from one emulation 
processor to the next emulation processor of a clustered 
processor when emulation processors within a clustered 
processor are interconnected with a cascade connection, 
such that the setup and storing of results is done in parallel, 
with the output of one emulation processor connected to the 
input of the next emulation processor with said cascade 
connection providing access to the intermediate values. 

8. The method according to claim 7, wherein with cms- 40 
tered processors share input and data stacks, arranged with 
the signal flowing through all emulation processors within 
said clustered processors, and wherein multiple logic levels 
are distributed among said emulation processors within a 
clustered processor for evaluation as a single step, with the 45 
same clustered processors being employed for evaluation of 
additional logic levels in additional steps employed for other 
logic levels with n evaluation steps, where n is the number 

of logic levels evaluated. 

9. An integrated circuit used in a processor-based system 
for emulating logic designs comprised of combinatorial and 
sequential logic gates, comprising: 

a plurality of input and data stack structures; 

a plurality of clusters of emulation processors, each of 
said plurality of clusters of emulation processors com- 
prising a plurality of emulation processors, each of said 
plurality of emulation processors comprising an execu- 
tion unit that sequentially evaluates the combinatorial 
logic functions; 

an interconnection network for interconnecting outputs 
from each of said plurality of emulation processors to 
inputs on any other of said plurality of emulation 
processors; and 
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each of said plurality of clusters of emulation processors 
being associated with a corresponding one of said 
plurality of input and data stack structures such that 
outputs from said corresponding one of said plurality of 
input and data stack structures are provided to each of 
said plurality of processors within one of said plurality 
of clusters of emulation processors and outputs from 
each of said plurality of emulation processors within 
said one of said plurality of clusters of emulation 
processors are input to said corresponding one of said 
plurality of input and data stack structures. 

10. The integrated circuit of claim 9 further comprising a 
plurality of cascade connections, each of said plurality of 
cascade connections placing a first of said plurality of 
emulation processors within one of said plurality of clusters 
of emulation processors in substantially direct electrical 
communication with all others of said plurality of emulation 
processors within said one of said plurality of clusters of 
emulation processors. 

11. The integrated circuit of claim 9 wherein one of said 
plurality of emulation processors within each of said plu- 
rality of clusters of emulation processors can have its output 
directed to either a subsequent one of said plurality of 
emulation processors within each of said plurality of clusters 
of emulation processors through a cascade connection or to 
said corresponding one of said plurality of input and data 
stack structures. 

12. An integrated circuit used in a processor-based system 
for emulating logic designs comprised of combinatorial and 
sequential logic, comprising: 

a plurality of emulation processors, each of said plurality 
of emulation processors comprising an execution unit 
that sequentially evaluates the combinatorial logic 
functions, wherein said plurality of emulation proces- 
sors are arranged as a plurality of clusters of emulation 
processors; 

an interconnection network for interconnecting outputs 
from each of said plurality of emulation processors to 
inputs on any other of said plurality of emulation 
processors; and 

a plurality of cascade connections, each of said plurality 
of cascade connections making outputs from said 
execution unit available to said execution unit within a 
subsequent one of said plurality of emulation proces- 
sors within one of said plurality of clusters of emulation 
processors. 

13. The integrated circuit of claim 12 further comprising 
a plurality of input and data stack structures; 

each of said plurality of clusters of emulation processors 
sharing a corresponding one of said plurality of input 
and data stack structures such that outputs from said 
corresponding one of said plurality of input and data 
stack structures are provided to each of said plurality of 
processors within one of said plurality of clusters of 
emulation processors and outputs from each of said 
plurality of emulation processors within said one of 
said plurality of clusters of emulation processors are 
input to said corresponding one of said plurality of 
input and data stack structures. 
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